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Abstract 

The fast Fourier transform (FFT) is undoubtedly an essential primitive that 
has been applied in various fields of science and engineering. In this paper, 
we present a decomposition method for parallelization of multi-dimensional 
FFTs with smallest communication amount for all ranges of the number of 
processes compared to previously proposed methods. This is achieved by two 
distinguishing features: adaptive decomposition and transpose order aware- 
ness. In the proposed method, the FFT data are decomposed based on a 
row-wise basis that maps the multi-dimensional data into one-dimensional 
data, and translates the corresponding coordinates from multi-dimensions 
into one-dimension so that the resultant one-dimensional data can be di- 
vided and allocated equally to the processes. As a result, differently from 
previous works that have the dimensions of decomposition pre-defined, our 
method can adaptively decompose the FFT data on the lowest possible di- 
mensions depending on the number of processes. In addition, this row-wise 
decomposition provides plenty of alternatives in data transpose, and different 
transpose order results in different amount of communication. Wc identify 
the best transpose orders with smallest communication amounts for the 3-D, 
4-D, and 5-D FFTs by analyzing all possible cases. Given both communi- 
cation efficiency and scalability, our method is promising in development of 
highly efficient parallel packages for the FFT. 
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1. Introduction 

Since the pioneering work of J. W. Cooley and J. W. Tukey in 1965 pQ, 
the fast Fourier transform (FFT) has become an essential kernel for numer- 
ous fields of science and engineering, such as electronic structure calculations, 
digital signal processing, medical image processing, communications, astron- 
omy, geology, optics, and so on |2l [3l HI |5] . In those apphcations, the FFT is 
usually performed with a large data set in multiple dimensions many times, 
making the FFT a computationally expensive calculation. Given the impor- 
tance of the FFT and widespread of massively parallel supercomputers of 
multi-core processors, it should come as no surprise that there have been a 
lot of efforts to parallelize the FFT that can be roughly categorized into two 
approaches: the direct approach for one-dimensional (1-D) FFTs, and the 
transpose approach for multi-dimensional FFTs. In the former approach, the 
1-D FFT is parallelized by dividing the original problem into sub-problems 
recursively, which are assigned to the processes for solving in parallel [6[ El [8] . 

In the latter approach for multi-dimensional FFTs, the FFT data are de- 
livered to the processes so that they can have enough data to perform the 
sequential 1-D FFTs locally in one specific dimension in parallel, for example 
with the wildly popular FFTW package [9J. The primary concern of this pa- 
per is the communication efficiency of the latter approach, as the data must 
be transposed, repeatedly if necessary until the data for all the dimensions 
have been FFT-transformed, for conducting the 1-D FFTs in other dimen- 
sions. The data transpose requires communications among the processes, 
and must be considered carefully, whereas the 1-D FFT is independent and 
communication-free. As a result, the communication efficiency heavily de- 
pends on the domain decomposition method, which should minimize the 
amount of communication, while not sacrificing scalability. 

The domain decomposition can be carried out in any degree from one- 
to M-dimensions for the M-dimensional (M-D) FFT, facing the trade-off 
between communication efficiency and scalability. The lower the degree of 
the domain decomposition is, the higher the communication efficiency is, and 
the lower the scalability is. That said, the 1-D decomposition achieves the 
highest communication efficiency with the lowest scalability, in contrast to 



2 



the M-D decomposition, which carries the lowest communication efficiency 
with the highest scalabihty. 

Let us take the most popular three-dimensional (3-D) FFT as an exam- 
ple, where the domain decomposition method exists in three forms: one-, 
two-, and three-dimensions. The alphabet hereafter is used to denote the di- 
mensions, for example, a for the first dimension, h for the second dimension, 
etc. The 1-D decomposition [Zmn], or slab decomposition, divides the 3-D 
data into equal blocks of complete a6-planes, for instance, along the c-axis, 
and requires only one transpose step with the smallest amount of commu- 
nication, but the number of processes applicable is limited to the size of 
one dimension. The 3-D decomposition, or volumetric decomposition, parti- 
tions the 3-D data along all three dimensions pTl [T2l [T3l E], and therefore 
three data transpose steps are necessary to perform the 1-D FFTs along the 
three dimensions, making the 3-D decomposition the costliest in terms of 
communication amount, yet the most scalable, as the maximum number of 
processes is equal to the size of the 3-D data. Sandwiched between the 1-D 
and 3-D decompositions is the 2-D decomposition (pencil or rod decomposi- 
tion) [151 CBl HZl HE] that draws smaller amount of communication than the 
3-D decomposition, and offers higher scalability than the 1-D decomposition, 
as the number of processes applicable is now up to the size of 2-D plane. 

Our work is motivated by the fact that the aforementioned domain de- 
composition methods usually pre-define the dimensions of decomposition, 
regardless of the number of processes. In particular, the 2-D decomposi- 
tion partitions in two dimensions, even when the number of processes is less 
than the size of one dimension and it is possible to decompose in only one 
dimension to reduce the communication amount, because lower degree of 
decomposition leads to smaller amount of communication. As can be ex- 
pected, the situation is worse for the 3-D decomposition, which works in 
three dimensions paying no attention to the number of processes. For bet- 
ter communication efficiency, this factor should be taken into account in the 
decomposition method. 

In addition, we observe that the order of data transpose has no effect 
upon the communication efficiency in those decomposition methods, as they 
treat the dimensions involved in the decomposition in the same way. For 
example, as mentioned above, the 2-D decomposition divides the afe-plane 
evenly, i.e., the a- and 6-axes are the same and the processes are allocated 
approximately equal numbers of data points along these two axes. In the 
two subsequent transpose steps, there is no difference in doing ac or he ffist. 
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because they all result in the same amount of communication. However, 
we find that if the dimensions involved in the decomposition are handled 
differently, specifically the data points on the afe-plane are divided to the 
processes in ascending order of their a- and then 6-coordinates, the transpose 
order will become a key factor in the subsequent transpose steps. Choosing 
the right transpose order will actually reuse a lot of data from the previous 
step and consequently reduce a substantial communication amount. 

Another motivation is that while the decomposition method for parallel 

3- D FFTs has been extensively investigated, little work has explored beyond 
them so far. In fact, 4-D and 5-D FFTs have various applications, and 
examples can be found in fields such as lattice quantum chromo dynamics 
simulations, where a Landau gauge fixing algorithm is implemented using the 

4- D FFT [19], in medical image processing with a 4-D FFT-based filtering 
[20] . in photography [211 [22], in drug design with 5-D FFT-based protein- 
protein docking algorithms [23l IMl |25], and others [26]. Hence, there is a 
real need to develop parallel M-D FFTs beyond 3-D FFTs. 

In this paper, we present a decomposition method for parallelization of 
multi-dimensional FFTs with two distinguishing features: adaptive decom- 
position and transpose order awareness for achieving smallest communica- 
tion amount compared to previously proposed methods. In our method, 
the FFT data are decomposed based on a row-wise basis that maps the M- 
D data into 1-D data, and translates the corresponding coordinates from 
multi-dimensions into one-dimension for equally dividing and allocating the 
resultant 1-D data to the processes. As a result, our method can adaptively 
decompose the FFT data on the lowest possible dimensions depending on the 
number of processes so that the communication amount can be minimized 
in the first place, differently from previous works that have fixed- dimensions 
of decomposition. In particular, the method decomposes in one dimension if 
the number of processes is less than or equal to the size of one dimension, 
in two dimensions if the number of processes is greater than the size of one 
dimension and less than or equal to the size of two dimensions, up to M- 
dimensions. Another unique feature of our method is the awareness of the 
transpose order. The row-wise decomposition provides plenty of alternatives 
in data transpose, and different transpose order results in different amount 
of communication. By analyzing all possible cases for transpose order, we 
find out the best transpose orders with smallest communication amounts for 
3-D, 4-D, and 5-D FFTs. Finally, our method is generalized to M-D FFTs. 

The remainder of the paper is organized as follows. Section 2 describes 
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M-D FFTs. The domain decomposition method is presented in Section 3, 

and Section 4 gives comparison in terms of communication amount between 
our method and other methods. Our study is concluded in Section 5. 

2. Multi-dimensional FFTs 

We start the discussion by introducing the M-D FFT by way of the 1- 
D FFT. The 1-D FFT transforms a 1-D data X{j) of N complex numbers 
{X{0),X {!),..., X{N-1)) into another 1-D d&t&X{k) of complex numbers 
{X{0),X{1), X{N - 1)) as follows: 

N-l 

X{k)^J2^NX{j), (1) 

3=0 

where = e"^'^*/^, i = 1, and the factor is dropped for simplicity. 

The 1-D FFT transforms the 1-D data that has exactly one discrete vari- 
able j into another 1-D data structure. Similarly, the M-D FFT transforms 
an M-D data X{ji,j2, ■■■tJm) that has M discrete variables ji, j2, ■■■tJm into 
another M-D data X{ki,k2, ku), defined by 

iVi-l / N2-I / Nm-I 

ii=o \ 02=0 \ jM=o 

(2) 

where ojNr — e and 1 < r < M. 

As can be seen from the above equation, the M-D FFT can be computed 
in M single steps, with each performing the 1-D FFTs along one specific 
dimension. This is a crucial feature exploited by the domain decomposition 
method for parallelization of the M-D FFT. The first step conducts the 
1-D FFTs along the last dimension, for instance, followed by a transpose 
operation. The second step executes the 1-D FFTs along the dimension 
prior to the last dimension, and so on. Lastly, the Mth step carries out the 
1-D FFTs along the first dimension, ending the M-D FFT. 

3. Domain decomposition method 

In this section, we present our domain decomposition method, starting 
with the 3-D FFT for ease of understanding. We then provide a general 
description of the method for the M-D FFT, and finally we describe the 
method for the 4-D and 5-D FFTs, and beyond them. 

5 




3.1. 3-D FFTs 

Here we illustrate and examine our decomposition method for the 3-D 
FFT. Assume that the numbers of data points along the a-, 6-, and c-axes 
are iVi, and iVa, respectively, the number of processes is Np, and myid is 
the process identification. 

Our method is basically based on a row- wise decomposition. It first trans- 
lates the original 3-D FFT data X^Y){a, b, c) into a 1-D data Xid{x), as illus- 
trated in Fig. [1] for the abc order as an example in a total of 3x2x1 = 6 
orders. The relationship between a 3-D coordinate X3£)(ai, 61, Ci) and a 1-D 
coordinate XiD(a;i) in the abc order is given by 



Xi = ai X N2 X N3 + bi X N3 + ci. 



(3) 



It is worth mentioning that the relationships in other orders different from 
abc can be derived in the same way. 



Ni 



aoboCo 



Three-dimensional data 
to one-dimensional data 
aobiCo 



ajboCo 



aoboCA3-i aobiCjvj-i aobN2-iCA'3-i aiboCAy-i aM-ibN2-iCAg-i 

A^3 data points 

1 

I N2 >^Nj, data points I 

TVj xjV2 xA^3 data points divided equally to A^p processes 

Figure 1: 3-D FFTs: 3-D data to 1-D data mapping in row- wise decomposition for the 
abc order. 

In order to identify the starting and ending points allocated to each pro- 
cess in the domain decomposition determined by Np, myid, Ni, N2, and N^, 
we define a function /sdO for translating the 3-D and 1-D coordinates as 
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follows: 



f3D{Ni,N2,Ns,Np,myid) 



Ni X myid 



NiN2Xmyid 



X Ns, if Ni<Np< N1N2; 



N\N2Nzxmyid 



if A^iA^2 <Np< N1N2N3, 

(4) 

where [J is the floor function. Our method then equally divides the resultant 
1-D data to the processes, in which a process with myid is assigned the data 



points from Xinix^^yi^) to Xir,{ 



X 



myid J 



in one dimension, where 



^myid = /3d(A^i, N2, N^, Np, myid), 
^myid = /3d(A^i, N2, Ns, Np, myid + 1) - 1. 



(5) 
(6) 



These 1-D coordinates can be translated back to the 3-D ones to obtain the 
corresponding starting and ending coordinates in three dimensions as 



j(«.e) 
myid 



X 



(s,e) 
myid 



X 



{s,e) 
myid 



o^dN2N, 



(7) 



(8) 



'"myid 



X. 



(s,e) 
myid 



o^dN2N, 



ds,e) 



^myid-'-^'i^ 



where X3D(a: 



hi 



myidi myidi myid 



and X3D(a^w,&: 



myid"! myidi ^myid 



(9) 

are the starting 

and ending points, respectively, in three dimensions for a process with myid. 

Consequently, the decomposition has three forms depending on the num- 
ber of processes. The distribution of data points is carried out in 1-D, 2-D, 
or 3-D data defined by the first one, two, or three dimensions, respectively. 
For example, with Np processes and Ni < Np < N1N2, the decomposition in 
the abc order takes place in the 2-D decomposition, where the data points 
on the afo-plane with the c-axis are divided over the processes in ascending 
order of their a- and 6-coordinates so that each process has approximately 
the same number of data points on the a6-plane. The data points extending 
along the c-axis that have the same a- and ^-coordinates are also assigned 
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to the same process. Therefore with the 2-D decomposition on the a6-plane 
in the abc order, a process with myid will be assigned the data points from 



myid^ bmyid^ ^3 " 1) in ascending order of the a-, 



b-, and c-coordinates, where a^^^^, 6^^^^^, a^j,^^, and 6^^^^ can be obtained 



from Eqs. [7] and |8} with 



X 



myid 



N1N2 X myid 



X 



myid 



N1N2 X {myid + 1) 



X N3, 



X A^, - 1. 



(10) 



(11) 



In subsequent transpose steps, the decomposition can occur in any order of 
combination of the three dimensions a, b, and c. 

Figure [2] exemplifies the operation of our method with transpose-order 
awareness (a) and transpose-order unawareness (b), followed by a conven- 
tional 2-D method for comparison (c 



The transpose order in Fig. 2 (a 



IS 

abc — )■ cab — )■ cba, and in Fig. 2(b) abc — )■ cab — )■ bca. Even though the 
only difference between them is the last decomposition, cba as against bca, 
this has far-reaching implications for the amount of reused data, because a 
majority of data can be reused with the transpose from cab to cba, while 
the transpose from cab to bca leaves only a minority of data that can be 
reused. For instance, with process PI, Fig. 2(a) shows a large overlap be- 



tween the areas assigned to it in cab and cba, implying that a large amount 
of data can be reused, whereas the overlap between cab and bca is small (Fig. 
2(b)[ ). In fact, as revealed later in Fig. 



3(a), the amount of communication 



with transpose-order unawareness, abc — )■ cab — )■ bca, is doubled compared to 
transpose-order awareness, abc — cab — cba. On the other hand, the con- 
ventional 2-D decomposition is applied to two dimensions that are treated 
in the same way so that the processes have approximately equal numbers 
of data points on these two dimensions, leaving no difference between cba 
and bca, and eventually no effect of the transpose order. Also, though the 
illustrations are intended for 2-D decomposition, the extension to 1-D and 
3-D decompositions is straightforward. 



Figure 3(a) shows a full analysis on the amount of communication to the 



number of processes with all 8 possible cases of transpose order for the 3- 
D FFT, with N x N x N data points for the sake of simplicity and clear 
presentation. Interestingly, the amount of communication is grouped into 
only two patterns of communication in our method, and there is one pattern 
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abc cab cba 




(a) Transpose-order awareness: transpose from cab to cba, a majority of data can be reused. 



abc cab bca 




(b) Transpose-order unawareness: transpose from cab to bca, only a minority of data can be 
reused. The amount of communication is doubled compared to (a). 




(c) Conventional 2-D decomposition: all dimensions are the same. 



Figure 2: 3-D FFTs with 2-D decomposition: our method with transpose-order awareness 
(a) and unawareness (b), and regular 2-D method (c). 
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abc 



Np! Number 
of processes 



cba 



bca 
_l_ 



cab 



acb 



NxNxN grid cab acb 


acb 


cab cba bca 


cba bca 


Np < N N^-N^/N, 2(N^-N^/Np) 


2(N'-N^/Np) 


2(N'-N^/Np) N^-N^/Np 2(N^-N^/Np) 


N^-N^/Np N'-N'/Np 


N < Np < (2-N/Np)N'-N' 2N'(N-1) 


2N^(N-1) 


2N^(N-1) (2-N/Np)N'-N^ 2N'(N-1) 


(2-N/Np)N'-N' (2-N/Np)N'-N^ 


Np=N^ 2N^(N-1) 2N^(N-1) 


2N^(N-1) 


2N^(N-1) 2N^(N-1) 2N^(N-1) 


2N^(N-1) 2N^(N-1) 


N^<Np<N' N'+2Np(N-l) N'+2Np(N-l) 


NV2Np(N-l) NV2Np(N-l) N^+2Np(N-l) N'+2Np{N-l) 


N'+2Np(N-l) N'+2Np(N-l) 


Np= N" 3N'(N-1) 3N'(N-1) 


3N'(N-1) 


3N'(N-1) 3N'(N-1) 3N'(N-1) 


3N'(N-1) 3N'(N-1) 



(a) 8 transpose order cases and the amount of communication corresponding to the number of 
processes. 



o 
'4-' 

u 
3 



o 
u 



o 



P2: abc -> bca -> acb P2: abc -> bca -> acb 



IT 



2xt 



-) 2A^'(A^-1) 
1-1.3X: 



PI: abc -> acb -> bca PI: abc -> acb -> bca 



(2 )N'-N^ 

N.: 



PI: abc -> acb -> bca 



N 

Number of Processes 



(b) 2 patterns (PI and P2), the amount of communication corresponding to the number 
of processes, and the difference between their amount. 



Figure 3: 3-D FFTs: all 8 cases (a), categorized into 2 patterns (b). 
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actually up to twice better than the other, as shown in Fig. 3(b) The 
difference between these two patterns is twice, when the number of processes 
is up to the size of one dimension, and up to 1.3 times, when the number of 
processes is in between the sizes of one and two dimensions. The difference 
remains almost unchanged by A^. We find four transpose orders leading to 
the pattern with the smaller amount of communication for all ranges of the 
number of processes: 

abc — )■ acb — )■ bca, 
abc — 7- acb — )■ cba, 
abc — )■ cba — )■ cab, 
abc — 7- cab — )■ cba. 

In general, we can follow any of these four orders in parallelization of the 
3-D FFT. In fact, our previous work has employed the method with the 2-D 
decomposition and the transpose order of abc — )■ cab — )• cba [27] . 

3.2. General description of the method 

Let Xu-DiNi, N2, Nm) be the M-D input data. Each process is al- 
located approximately ^^^^^j^'"^^" data points. The decomposition starts 
from the first dimension and gradually moves down to the last dimension 
of the data structure to assign the data points to the processes according 
to the number of processes. The problem turns to finding the starting and 
ending coordinates of each dimension of the data structure for each pro- 
cess, specifically, the starting point XM-nixl^yid^ xl,myid^ ^M,myid) and the 
ending point XM-olx^^^^i^, ^^^i^, x^^,^^^^) that together define the data 
points distributed to a particular process with myid. This procedure can be 
generalized from the previous case of 3-D FFTs. 

After the M-D input data have been translated into the 1-D data, a 
1-D coordinate Xiy){Xi) is identified from the equivalent M-D coordinate 
XM-Ti{xi,X2, ..^Xm) as follows: 

Xi = Xi X A^2 X A^3 X ■ ■ ■ X A^M + X2 X A^3 X A^4 X ■ ■ ■ X Nm^ hXM-1 X A^m + Xm- 

(12) 
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As well as the case of the 3-D FFT, a function /m-d() for translating the 
M-D and 1-D coordinates is defined as 



A'^l X my ld 



fu-D^Ni, N2,..., Nm, Np, myid) = 

X N2N2, X • • ■ X Nm, if Np < Ni, 



NiN2Xmyid 



X A^a X • • • X A^M, if A^i < Ap < N1N2; 



NiN2X---xNMXmyid 



if N1N2 X • • • X A^M-i < Ap < N1N2 X • • 

(13) 



In one-dimension, the data points from -^^10(2;^^^^) to ^mix'^yid) allo- 
cated to a process with myid, where 



^myid = /m-d(A^i, A^2, •-, A^M, Ap, myid), 
^myid = /m-d(A^i, A"2, ■-, A^M, Np, myid + 1) - 1. 



(14) 
(15) 



Finally, in M-dimensions, the starting point XM-ii{x{ „^y^d, ^'l^mytd^ ^M,myid) 
and the ending point Xu-Dixlmyid^ xl,myid^ ^M,myid) of a process with myid 
are given by 

XV-.. ^ , (16) 

(17) 



X. 



2,myid 



(s,e) 
l,myid 

(s,e) 



N2N3 X ■■■ X N, 



M 



X, 



- 4%dN2Ns x...xNm 



N'^x-'-xN. 



M 



X N, 



M- 



X 



(s,e) 
myid 



X 



X 



(s,e) 
M- 



X N 



M 



l,myid 
•^2,myid-^^3 



N 



M 



X 



X N, 



M 



X 



{s,e) 

M-2,myid 



M 



(18) 



12 



X 




= X, 



M 



X 



M—l,myid 



X 



(19) 



With the definition of the function /m-d()) the row-wise decomposition 
realizes the first feature of our method, adaptive decomposition. This is 
how our method is adaptive and flexible to the number of processes Np. 
When Np is less than or equal to the size of the first dimension Ni, the 
method will partition the M-D data in only that dimension, and assign about 
^ X (A''2 X ... X Nm) data points to each process. Likewise, if Np is between 
A^i and Ni x N2, the decomposition will take place in the first and second 
dimensions, and distribute approximately ^'^^^^ x (A^3 x ... x Nm) data points 
to each process. As lower degree of decomposition requires smaller amount of 
communication, our method is able to minimize the communication amount 
in the first place. 

This adaptive decomposition is conducted when data transpose is per- 
formed to re-allocate the data points to the processes. The computation is 
relatively simple, and therefore its computational time is trivial. Figure 4 
outlines the parallel M-D FFT with the decomposition for data transpose. 
The order of transpose used in the figure is general, and the actual order, 
accompanied by the amount of communication, is analyzed for the 3-D, 4-D, 
and 5-D FFTs. 

Since this is a row-wise-based distribution, the method leaves a number 
of order options to be explored, because in this case, abc is no longer identical 
to cab with the 3-D FFT, as illustrated previously. The order of transpose 
plays a key role in the ability of re-using data that is directly related to the 
amount of communication. Let us calculate the total number of cases for 
the M-D FFT. Since we start with the first transpose, it has only a single 
case. In the next transpose, as we have M — 1 remaining dimensions to 
choose from and each dimension has (M — 1)! cases to be explored, there are 
(M - 1) X (M - 1)! cases in this step. Similarly, there are (M - 2) x (M - 1)! 
cases in the third transpose, because we have M — 2 remaining dimensions 
with each dimension having (M — 1)! cases. In the Mth (final) transpose, 
there are 1 x (M — 1)! cases. Multiplication of all the cases in M steps gives 
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Input: X{Ni,N2,...,Nm), Np, myid 
Output: X{Ni,N2,...,Nm) 

1 Step 1: Perform A^i x iVa x ■ ■ ■ x Nm-i 1-D FFTs, each with Nm 
points along the Mth dimension. 

2 Transpose: Conduct the adaptive decomposition. 

X'^ikM,jl,j2,--jM-l) = X^ijlj2,--jM-l:kM) 

3 Step 2: Perform A^m x A^i x A^2 x ■ ■ ■ x Nm-2 1-D FFTs, each with 
Nm-i points along the (M — l)th dimension. 

X^{kM,jl,j2,---,jM-2,kM-l) = 

jM-i=0 ^ [>^M,Jl,j2,—,jM-l) 

4 Transpose: Conduct the adaptive decomposition. 

X'^{kM, kM-l,jl,j2, JM-2 3u32, ■■■,jM-2, ku-i) 

5 ... 

6 Step M: Perform Nm x Nm-i x • • • x A^2 1-D FFTs, each with Ni 
points along the first dimension. 

X^^'-'ikM, kM-i, k2, k,) = Y.l'-^uj'tx^''-\kM, kM-i, ^2, Ji) 

7 Transpose: Conduct the adaptive decomposition. 
X(ki, k2, kM-i, kM) — X'^^~^{kM, ^m-i, ^2, ^1) 

Figure 4: Parallel M-D FFTs with adaptive decomposition for data transpose. 



US the total number of cases for the M-D FFT 

Cm-d= 1 x(M-l) X (M-l)!x(M-2) X (iH-l)!x... 

1st transpose 2nd transpose 3rd transpose 

X 2 x (M- 1)! X 1 X (M- 1)! 

V ^ ^ V ^ 

(M-l)th transpose Mth transpose 

= (M - 1) X (M - 2) X ... x 2 x 1 X (M - l)!^"^ 

= (M-1)!^ (20) 

Table 1 shows the number of order cases corresponding to the number of 
dimensions. With the 3-D FFT, there are 8 cases only, which can be examined 
manually. However, with the 4-D and 5-D FFTs, these numbers are 1,296 
and 7,962,624, respectively, that can be investigated thoroughly by computer 
simulations. Even so, it is practically difficult with 2,985,984,000,000 cases 
for the 6-D FFT. Therefore, we have no choice but to limit ourselves to the 
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Table 1: Number of cases. 

Number of dimensions Number of cases 

2 1 

3 8 

4 1,296 

5 7,962,624 

6 2,985,984,000,000 
M (M-1)!^^ 



5-D FFT, and try to generalize the results to those beyond them. 

3.3. 4-D FFTs, 5-D FFTs, and beyond 

Figure |5] depicts the patterns, the amount of communication correspond- 
ing to the number of processes, and the difference between their amount in 
different ranges for the 4-D FFT (a) and 5-D FFT (b), where the M-D data 
are assumed to have an equal size in all dimensions. Due to their complex 
nature, we do not have any illustrations for the decomposition method. As 
it is also difficult to present all possible cases, 1,296 for the 4-D FFT and 
7,962,624 for the 5-D FFT, here we only show the best and worst patterns. 

With the 4-D FFT, we find four transpose orders: 

abed — > abdc — > dcha — rfcafo, 

abed — abde — deab — )■ deba, 

abed — )■ abde — )• edab — )■ edba, 

abed — abde — > edba — edab., 

that always offer the smallest amount of communication for all ranges of the 
number of processes. The best transpose order is exactly 3 times better than 
the worst order for up to N"^ processes, and up to 1.5 times for the next range 
of [A^^,A^^]. The difference remains almost the same, apparently unaffected 
by N. 

The number of always-best transpose orders is found to be higher with 
the 5-D FFT than with the 4-D FFT: 96 orders, including 

abede — )■ abeed — )• abede — )■ eedba — )■ eedab, 
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(a) 4-D FFTs: the patterns, the amount of communication corresponding to the number of processes, 
and the difference between their amount. 



P4: abcde baced -> P51: abcde ^ cdaeb 
abdec bcdea acdeb abced -> cdbea -> abdec 



4(iV' 



P3: abcde -> abced -> 
badec -> acdeb -> bcdea 

P2: abcde abced -> 
abdec bcdea acdeb 



4x 



A/^^ P1 32: abcde ->adebc-^ 
) abced adecb bcdea 



■) 



2.1-2.7X 
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1-1.4X 



PI: abcde abedc -> 
abecd cdeab cdeba 



PI: abcde -> abedc -> 
abecd -> cdeab -> cdeba 2(N^ 
Ml 

abecd cdeab cdeba abecd cdeab ^ cdeba — — 



PI: abcde abedc -> 
abecd cdeab cdeba 



PI: abcde abedc -> PI: abcde -> abedc -> 



N„ 



) + 2N' -N' -N„ 



IT 



IT 



Number of Processes 



(b) 5-D FFTs: the patterns, the amount of communication corresponding to the number of pro- 
cesses, and the difference between their amount. 



Figure 5: 4-D and 5-D FFTs. 
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abode — )■ abced — )■ abedc — )■ ecdba ecdab, 

abode — )■ abced — abedc — )■ cdeba — > cdeab, 

abode — )■ aboed — )• abode — )■ odeab odeba. 

The gap between the best and worst orders is also higher with the 5-D FFT, 
especially with a small number of processes. With up to processes, it is 
exactly 4 times. The gap is from 2.1 to 2.7 times for the range of [A^^, A^^], 
and up to 1.4 times for the next range of [A^^,A^^]. Similar to the 3-D and 
4-D FFTs, the gap is found to be almost completely independent of N. 

Based on these observations, our decomposition method for higher di- 
mensional FFTs is thought to produce higher number of transpose orders 
that are always best for every range of the number of processes. And so is 
the gap in the amount of communication between the worst and best orders. 
Regarding the transpose order, we notice that the best orders of the 3-D 
FFT are in the form of either 1+2 or 2+1, i.e., a{bo) or {ab)o, where 2 sim- 
ply means a transpose ab — t- ba. The form of a{bo) leads to the subsequent 
transpose order of a{ob), and then {cb)a, being one of the four best orders 
for the 3-D FFT. Likewise, the form of {ab)o and its consequent orders of 
o{ab) and o{ba) are another best order. For the 4-D FFT, they follow the 
form of 2+2, {ab){od), but neither 1+3, a{bod), nor 3+1, {abo)d. Meanwhile, 
the forms of 2+3, {ab){ode), and 3+2, {abc){de), are the best combinations 
for the 5-D FFT. Consequently, we expect the form of 3+3, {abc){def), to 
deliver better, if not best, performance for the 6-D FFT. Better forms for 
higher-dimensional FFTs can be derived in the same way such that their two 
parts are the most balanced. 

4. Comparison of Communication Amount 

Figure |6] compares our proposed method with the 1-D method \lOl, 1.5-D 
method [28j, and 2-D method fl6] (a), and with the 3-D method [IT] (b), 
for the 3-D FFT. The 3-D method is displayed in a separate figure for a 
clear presentation, as the amount of communication in this case is far larger 
than the other cases. The amount of communication is calculated in case 
of 64 X 64 X 64 data points. Certainly, similar results can be produced for 
smaller and larger numbers of data points. 

The 1-D method divides the 3-D data into blocks of equal numbers of 
complete afe-planes along the c-axis, and then allocates them to the processes 
to perform the 1-D FFTs along the a- and 6-axes, followed by a data transpose 
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Figure 6: Comparison for 3-D FFTs in terms of communication amount with 64 x 64 x 64 
data points. \g 



so that each process contains complete ac-planes to carry out the 1-D FFTs 
along the remaining c-axis, with a total communication amount 



^.o^iV^'-£. (21) 

In the 2-D method, the a6-plane is evenly divided to the processes, with each 
having all the data along the remaining c-axis. The 1-D FFTs along c are 
executed, followed by two transpose steps so that the 1-D FFTs along the 
a- and 6-axes can also be performed. The 2-D method has a communication 
amount 

TV^ 3 

^D = 2iV3-2iVp(— )2. (22) 

The 1.5-D method lies between these two methods, with the amount being 
similar to that of the 1-D method, when Np < N, and the 2-D method, when 
N < Np < N'^-^. Lastly, the 3-D method partitions the 3-D data along all 
three dimensions, and requires an amount 

AsD = SN%^ - 1). (23) 



As shown in Fig. 6(a) the 1-D method is able to work only along one 
dimension and is limited to 64 processes, while the 2-D method decomposes 
the domain in two dimensions, even for fewer than 64 processes. The 1.5-D 
method offers a compromise between the 1-D and 2-D methods. By contrast, 
our method is the most flexible and adaptive, as it partitions only along one 
dimension when the number of processes is up to 64 on condition that it is a 
divisor of 64, and decomposes in two dimensions while still starting from one 
dimension for a larger number of processes with transpose-order awareness 
to reuse as many data points as possible. As a result, up to 64 processes, our 
method works in the same fashion as the 1-D and 1.5-D methods provided 
that 64 is a multiple of the number of processes, and is about 60.0% to 77.8% 
better than the 2-D method with 64 x 64 x 64 data points. From this point 
to 512 processes, the limit of the 1.5-D method, our method still has the 
edge over the 1.5-D and 2-D methods. Beyond the point of 512 processes, 
the 1.5-D method is no longer applicable, while the two other methods can 
operate until reaching the limit of 64 x 64 = 4, 096 processes for the 2- 
D decomposition. The performance gap also gradually decreases, however. 
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and eventually becomes from 2,048 processes. From 4,096 processes to 
the maximum 262,144 processes, only the 3-D decomposition is workable. 



Figure 6(b) demonstrates that our method outperforms the 3-D method, 
with the performance gain able to reach several orders of up to 11.6 (at 8,192 
processes) for the 3-D decomposition alone. 

Figure [7] extends the comparison by projecting the results obtained by 
our method and two other methods for the 4-D and 5-D FFTs, with 16 x 
16 X 16 X 16 and 16 x 16 x 16 x 16 x 16 data points, respectively, for a smaller 
number of processes. The 4-D and 5-D methods are assumed to operate in 
4-D and 5-D decompositions, respectively, akin to the 3-D method. As a 
result, the amounts of communication are 

= ^N\^ - 1) (24) 



for the 4-D method, and 

Ad = 5iV5(^ - 1) (25) 



for the 5-D method. Given these conditions, our method has a distinct 
advantage, leaving a wide performance gap of up to approximately 12 times 
for the 4-D FFT, and 11.1 times for the 5-D FFT. The gap is found to remain 
almost unchanged by A^. 



5. Conclusion 

We have presented our decomposition method for parallelization of multi- 
dimensional FFTs. The communication amount is the smallest compared to 
previously proposed methods, accomplished by adaptive decomposition and 
transpose order awareness. Featured by the row-wise decomposition that 
translates the M-D data into 1-D data and evenly divides the resultant 1- 
D data to the processes, our method can adaptively decompose the FFT 
data on the lowest possible dimensions based on the number of processes. In 
addition, our row-wise decomposition method provides a lot of alternatives 
in data transpose, among them the best communication efficient orders are 
identified and applied. We have determined the best transpose orders for 
the 3-D, 4-D, and 5-D FFTs, dependent on which we find out the way for 
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Figure 7: Comparison for 4-D and 5-D FFTs. 
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deriving the transpose orders that can dehver better performance for higher- 
dimensional FFTs. Comparison in terms of communication amount shows 
that our method is superior to other methods for the 3-D FFT, and it is antic- 
ipated to have a distinct advantage for higher-dimensional FFTs. Actually, 
the method has been employed in our open-source density functional theory 
code called OpenMX [21] . Boosting communication efficiency while not sac- 
rificing scalability, our method is promising to be harnessed in development 
of highly efficient parallel packages for multi-dimensional FFTs. 
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